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Abstract 

We use Levy processes to generate joint prior distributions, and therefore penalty 
functions, for a location parameter j5 — (J3i , , j8p) as p grows large. This gener- 
alizes the class of local-global shrinkage rules based on scale mixtures of normals, 
illuminates new connections among disparate methods, and leads to new results for 
computing posterior means and modes under a wide class of priors. We extend this 
framework to large-scale regularized regression problems where p > n, and provide 
comparisons with other methodologies. 

Keywords: Levy processes; normal scale mixtures; shrinkage; sparsity; PCR; PLS. 



1 Introduction 

In recent years there has been considerable interest in the subject of choosing a joint prior 
distribution, or equivalently a penalty function, for a high-dimensional location vector 
P = (jSi, . . . ,j8p). Much of this work has been motivated by problems in high-dimensional 
regularized regression, where we observe data y = X/3 + £ and wish to estimate j8. Good 
examples of recent Bayesian research in this area are the papers of Park and Casel"ia| ( |2008 



and Hans| (2009 1, who explore Bayesian versions of the traditional lasso penalty. 



In the case where the number of regressors is moderate, the use of normal variance 
mixtures to generate exchangeable joint distributions for j8 has been studied in detail. Such 
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priors arise from a hierarchical model where 



(Pj\*,Xf) ~ N(0,t^/) (1) 
Xj ~ />(A 2 ) (2) 
(r 2 ,a 2 ) ~ P (T 2 ,a 2 ), (3) 

with the Xj's known as the local shrinkage parameters. The following section reviews 
several recent proposals along these lines. 

The analogous classical formulation is to identify a penalty function with the log prior: 

g(Pj) = -\0gp(Pj\T 2 ) 

poo 

p(I5j\t 2 ) = J o P ^ j \x\X 2 )p{X))dX). 
Upon observing data y, j3 is chosen to minimize 

/( J 8) = ||y-X i 8|| 2 + v£g( i 8 ; -), (4) 

where T 2 has been re-expressed in terms of the regularization parameter v. The minimizing 
choice of j3 is equivalent to the joint posterior mode under the Bayesian formulation. 

Our interest in this framework arises from the intersection of two challenges, one theo- 
retical and the other practical. 

1 . The usual variance-mixture approach provides an insufficiently general understand- 
ing of how priors, penalty functions, and Bayesian variable selection are related. For 



example, some penalty functions lack variance-mixture equivalents (e.g. Fan and Li 



2001), while some variance mixtures lead to penalty functions without closed-form 



representations (e.g. Carvalho et al. 2010 1. Meanwhile, Bayesian variable selection, 



involving a complex prior and a large discrete space of submodels, would seem a 
world apart from either approach. 

2. As p grows large, many potentially useful approaches become computationally in- 
tractable, or yield answers whose quality is difficult to assess using standard tools. 

In this paper, we explore an alternative to the traditional framework by constructing 
joint priors for j8 using Levy processes. This provides a unifying probabilistic structure for 
penalized regression and variable selection from both Bayesian and classical viewpoints. 

Even within this new framework, a complete theory capable of completely solving both 
challenges listed above remains beyond our reach. Nonetheless, we will argue that the 
Levy-process view offers several insights that are highly relevant to statistical practice. 
First, our approach embeds finite-dimensional normal variance mixtures in a wider class 
of infinite-dimensional, non-Gaussian joint distributions. It therefore provides an intuitive 
framework for asymptotic analysis on existing priors and penalties, as well as a device 
for generating previously unexplored options (such as the Meixner and z-distributions dis- 
cussed in Section [3]>. The use of Levy processes in high-dimensional Bayesian modeling 
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has been gaining in popularity (e.g. |Wolpert and Taqqu] |2005| ; Wolper t et al.| |2010] l. Our 
approach differs from this line of work, in that we wish to use the theory of Levy processes 
to provide a general framework of penalty functions, shrinkage priors with exchangeable 
structure, and the relationship between them. Sections [3] and [4] will explore these relation- 
ships in depth, while Section [5] will demonstrate their statistical relevance. 

Second, we show that both Bayesian variable selection and the pure-shrinkage approach 
of something like the lasso can be subsumed into a unified theoretical framework. Connec- 
tions between these two approaches are important due to the acute computational difficul- 
ties associated with the high-dimensional variable-selection problem. Indeed, Section [5] 
describes an asymptotic sense in which the two models agree on certain important features. 

Finally, our framework provides new insight on how the two quantities most typically 
of interest — the posterior mode and mean of j8 — can be computed. We prove a theorem 
characterizing the posterior mean for /3 in terms of the Levy measure of the subordinator 
used to construct the joint prior /?(/?). This theorem can be used to understand the issue of 
Bayesian robustness for a much wider class of priors than those for which existing tools are 
sufficient (c.f. Per icchi and Smith] |1992| [Griffin and Brown 2010 1. We also show how the 
Levy-process approach leads to a simple mode-finding algorithm, analogous to the local 
linear approximation (LLA) of |Zou and Li ( 2008 ). 

Sections [6] and [7] illustrate applications of the approach in high-dimensional regression 
problems, including those where p > n, by placing local shrinkage priors on certain linear 
combinations of the j8/s. These linear combinations are given by the right-singular vectors 



of the design matrix. Our approach therefore builds upon the work of Frank and Fried- 



man 



( fl993] ), |ayde et al.| ( [T996] ), |Denison and George| ( p000l ), |Westl ( p003] ), and |Maruyama 



and George] (2010). These authors provide a unified framework for ridge regression (RR), 



principal-component regression (PCR), partial least-squares (PLS), the g-prior, and gener- 
alized g-priors. We generalize this framework still further by combining it with the idea of 
using local-shrinkage priors derived from Levy processes. 



2 Local shrinkage priors 



The class of joint priors jc(j8) based on exchangeable normal variance mixtures ([T]]3]) in- 
cludes widely known forms such as the t and the double-exponential, along with some of 
the following, more recent proposals. 



Normal/Jeffreys, where p(f$j) °= 1/3,1 (Figueiredo 



2003 



Bae and Mallick 



2004). It 



arises from placing Jeffreys' prior upon each local variance: p(Xf) °< 1/hf. 

Normal/exponential-gamma, where Xj ~ Exp(r), and where there is a second-level Ga(c, 1 ) 
prior for the exponential rate parameter r (Griffin and Brown 2005 ). Marginally, this 
gives p{Xf) - (i+A, 2 )-^. 

Normal/gamma and normal/inverse-Gaussian, where the local variances receive gamma 
or inverse-Gaussian mixing densities ( |Caron and Doucet 2008 \ |Griffin and Brown] 
20T01. 
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Horseshoe prior, a special case of a normal/inverted-beta class, where Xf ~ lB(a,b) has 
an inverted-beta distribution ( |Carvalho et al.[|2010"t|Polson and Scott] |2010| ). 



Generalized double-Pareto, which has a Laplace-like spike at zero and polynomial tails 



(Armagan et al. 2010). 



Full posterior inference under these priors can be viewed as a Bayesian analogue of penalized- 
likelihood estimation. For a more extensive bibliography, see Poison and Scott ( |2011 ). 



These priors are typically used when j8 is expected to be sparse. A natural question is: 
why should Bayesians consider such an approach to a sparse problem, when these local- 
shrinkage priors do not explicitly allow for the possibility that some of the j3/s are zero 
with positive prior probability? At least three reasons suggest themselves. 

First, suppose that one proceeds in the traditional Bayesian way, by averaging over 
different submodels in proportion to their posterior probabilities. These model-averaged 
coefficients will be nonzero with probability 1 under the sampling distribution for y, re- 
gardless of j6 , and hence may be practically indistinguishable from the posterior mean of jS 
a carefully chosen shrinkage prior. 

Second, many Bayesians oppose testing point null hypotheses, and would rather shrink 
than select, on the grounds that point nulls are unrealistic. Sparse shrinkage priors offer a 
compromise. They discount the possibility that /$y = 0, yet they sift signals from noise more 
aggressively than a traditional elliptical prior. 

Finally, the pure-shrinkage answer can offer computational gains over Bayesian model 
averaging. For a normal linear model with conjugate priors, the difference may be small. 
But for cases where marginal likelihoods of different regression hypotheses cannot be com- 
puted in closed form, the difference may be substantial, and the shrinkage approach can be 
used to approximate the model-averaged solution. 

To illustrate this third argument, we simulated data from a probit model with p = 25 
and n = 500: 

yi = ht>0 for i= l,...,n 
z ~ N(X/3,/), 

where /3 contained 20 zeros along with 5 nonzero entries, all equal to v5 — a so-called 
"r-spike signal" with r = 5 and ||j8 1| 2 = p. The rows of X were simulated from a multi- 
variate normal distribution whose covariance matrix was drawn from an inverse-Wishart 
distribution, centered at I p and with p + 2 degrees of freedom. 

We simulated 100 data sets from this model, and compared four approaches for esti- 
mating j6 using the probit link function: (1) maximum likelihood, using the glm function 



in R; (2) lasso-CT, using the lasso penalty and choosing v = \/21og p as in Candes and 



Tao (2007[); (3) lasso-CV, with v chosen by cross-validation; and (4) HS, the horseshoe 



posterior-mean estimator ( Carvalho et al.| 2010] ), a recent example of a pure- shrinkage ap- 



proach designed to estimate sparse signals. We measured accuracy in estimating j3 by 
squared-error loss. Table[T]shows the median and mean sum of squared errors realized over 
the 100 simulations. The pure- shrinkage Bayesian model outperformed the alternatives by 
a wide margin. 
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Table 1 : Median and mean sum of squared errors in reconstructing the probit r-spike signal 
in 100 simulated data sets. 



MLE Lasso-CT Lasso-CV HS 
Median SSE 191) 153 123 07 

MeanSSE 68.6 15.4 11.7 1.6 



Bayesian model averaging would be difficult here: the marginal likelihood for a given 
submodel cannot be computed in closed form, even assuming a conditionally conjugate 
prior for j8. Either high-dimensional numerical integration or a Laplace approximation 
must be used instead. By contrast, a pure-shrinkage model is no harder to fit for binary data 
than it is for continuous data, using the simple trick of data augmentation. 

This example motivates the question of how one should choose a prior 7i(Xj), or equiv- 
alently a penalty function, since different choices can lead to large differences in perfor- 
mance. The oracle property provides a unifying framework for evaluating procedures un- 
der a classical framework; many different criteria have been proposed for accomplishing the 
same goal under a Bayesian framework. To our knowledge only the lasso has been studied 
extensively under both paradigms. 

One interesting question is: how can we translate between the Bayesian and penalized- 
likelihood formulations? In the following section, we use the theory of Levy processes to 
establish a series of three (successively more general) characterizations of shrinkage priors 
and their relationship with penalty functions. 



3 Priors and penalties from Levy processes 

3.1 Normal variance mixtures and subordinated Brownian motion 



Our goal is to provide a framework in which important features of a prior for a high- 
dimensional location vector j3 can be studied in terms of the Levy measure /I (dx) of some 
Levy process. This perspective gives applied modelers a large toolbox for constructing 
prior distributions or penalty functions with specific desired properties. 

This approach is most readily introduced via the special case of ([TJ-Q studied by |Caron| 
and Doucet (2008 ) and Griffin and Brown (2010), where the normal-gamma prior for j8 is 



seen to be the finite-dimensional marginal distribution of a variance-gamma process. 

Let T(s) be a standard gamma process having marginal distribution T(s) ~ Ga(s, 1] 
time s > 0. Because the gamma distribution is self-similar, for any value of p 



at 



r(v) = LA ; 2 

7=1 
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if (Xj | v) ~ Ga(v/p, 1), or equivalently if the Xj's are identified with the increments of T: 

As p diverges, one may identify each local variance Xj with precisely one of the countable 
jumps in the sample path of the gamma process. A tangential but interesting fact is that, if 
we were to normalize the Xj's by their sum T(v), we would obtain the joint distribution for 



the weights in a Dirichlet-process mixture model (Kingman 1975 ). 



The gamma process is just one example of a subordinator, or a one-dimensional Levy 
process that is nondecreasing with probability 1. If T(s) is a subordinator and W(s) is a 
standard Wiener process, then the Levy process Z{s) = W{T(s)} is an example of subor- 
dinated Brownian motion observed on a random irregular time scale, a construction first 
explored by S. Bochner in the 1950's. The increments of T yield the local variances {A?}, 
while the increments of Z give us the regression coefficients {fij}. When T is a gamma 
subordinator, Z is called a variance-gamma process. 

Subordinated Brownian motion is the natural infinite-dimensional generalization of a 
normal variance mixture. Specifying the subordinator is equivalent to specifying the mixing 
measure p(Xj). 

In this way, one may define a joint distribution for /3 by way of a single quantity: the 
marginal distribution of a subordinator T at time s = v. One may generate other joint 
distributions for /3 via the same device of slicing up a subordinator into its increments, 
and identifying these increments with the variances {Xj} in a conditionally normal joint 
distribution for /3. If, for example, T(v) is inverse-Gaussian, then each /3 ; will have a 
normal/inverse-Gaussian distribution (see, e.g., |Barndorff -Nielsen| 1997 1. 



An important feature of subordinators is that they are infinitely divisible. This ensures 
that our construction remains sensible even in the infinite-dimensional limit. For example, 
suppose that we identify the local variances Xj of p different j3/s with the increments of 
T, a subordinator, observed on a regular grid. This p-variate random variable can then be 
described a priori in terms of the behavior of a single random variable T, which specifies 
an easily interpretable aggregate feature of the /3 sequence — namely, the sum of the local 
variances. If we were then to consider 2p j8/s instead, but wished to retain the same aggre- 
gate features of the (now longer) j8 sequence, we must merely slice up the increments of 
the original subordinator on a finer grid. 

Self-similarity is a more restrictive but very appealing property. It will ensure that, as 
p grows and we divide the subordinator into arbitrarily fine increments, the probabilistic 
structure of the local precisions remains the same — a useful fact if one wishes to study a 
procedure's asymptotic properties. For an extensive discussion and further bibliography of 
asymptotic theory regarding Levy processes, see|Ait-Sahalia an d Jacod| ( |2009| ). 

3.2 Penalty functions and subordinators 

Not all interesting penalties can be easily interpreted in the same way as the normal-gamma. 
For example, the lasso corresponds to an exponential mixing distribution for Xj. Yet a sum 



6 



of exponentials is not itself exponential, making it difficult to interpret the lasso prior as the 
increments of subordinated Brownian motion. 

Luckily the theory of subordinators can be used in a slightly different way to obtain an 
alternative characterization of priors and penalty functions. Stated informally: all totally 
monotone penalty functions that vanish at zero correspond to priors that can be represented 
in terms of a subordinator. For certain penalties, this subordinator naturally corresponds to 
the precision, rather than the variance, of a conditionally normal prior. We present this con- 
struction in the following theorem, which provides a rich source of new penalty functions 
with explicit Bayesian formulations as mixtures of familiar distributions. Throughout the 
following discussion, we let t denote a dummy argument involving /3y. 

Theorem 1. Let yf(t), t > 0, be a nonnegative-real-valued, totally monotone function such 
that lim f ^o V(0 = 0. 

Part A: Suppose that these conditions are met for t = /(j3y). Then the prior distribution 
p(Pj | s) oc exp{— sY[f(Pj)]}, where s > 0, is the moment- generating function of a 
subordinator T(s), evaluated at f(Pj), whose Levy measure satisfies 

poo 

W(t) = / {1 -exp(-fx)} fx(dx) . (5) 
Jo 

Part B: Suppose that these conditions are met fort = fij /2. Then p(j5j \ s) °< exp{— sy/(j3?/2)}, 
where s > 0, is a mixture of normals given by 



poo 

p(Pj\s)~ j o N{Pj\0,T- l )T- l / 2 p(T)dT. 



(6) 



where p{T) is the density of the subordinator T, observed at time s, whose Levy 
measure ji. (dx) satisfies (|5]). 

As an example, consider the bridge estimator, for which log/?(j6y) = — v|j6/| a . Write 
this instead as — v(jS?/2) a / 2 , in which case the conditions of Theorem [lj are met for a € 
(0,2]. The resulting normal mixture is easily recognized as the moment-generating func- 
tion, evaluated at t = Pj/2, of a positive alpha-stable subordinator T with stability index 
a/2, observed at time s = v. This provides a very simple proof of the fact the exponential- 
power priors are normal mixtures (West} 1987 1. 



The special case of the lasso (a = 1) leads to a Stable(l/2) law for T. This is equivalent 
to an inverse-Gaussian representation of the lasso prior on the precision scale: 



r e -TPj/2 V p -*/{XT)dT 

Jo V2nT 3 



IN(0,v) ^ £lN(0,v//>). 

The inverse-Gaussian (IN) distribution, moreover, is self-similar: a sum of p inverse- 
Gaussian precision terms is still inverse-Gaussian, an analytically convenient property which 
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the lasso fails to exhibit on the Xj scale. This provides an alternative to the lasso's well- 
known characterization in terms of an exponential mixing distribution for Xj. 

Though we do not consider the point at length, Part B can be extended to the case where 
t = \fij\ b , b G (0,2], subject to further mild regularity conditions on iff- The prior p(fij) will 
be a mixture of exponential-power distributions — itself a mixture of normals, in which case 
the law of iterated expectation will be enough to establish the result. 

We can also consider a mixture or Rao-Blackwellized penalty function as follows. Sup- 
pose we define 



g(P) = -logETCvexp { - v £ Y(tj)\] , 

7=1 



where the expectation is under a prior p(v), and where C v is the normalization constant in 
p(f3 | v). Suppose that y(t) satisfies the conditions of the previous theorem and that the 
prior for v can be described in the same way by a subordinator T{s) with Levy measure 
jj,(dx). Then since T is a subordinator, its moment-generating function is 

M s {t) = E{exp(-tT{s))} = exp{-sx(t)} 

X(t) = / {l-exp(tx)}n(dx), 
Jo 

To compute the mixture penalty function, simply evaluate this moment-generating function 
for at t = £f =1 \j/(tj) to give 

*(0)=*jl>(/?)j, 

where we have absorbed a factor of C^ 1 into the implicit prior for v, to cancel with the 
normalization constant from p(f3 | v). 

Consider the example of bridge estimation with an alpha-stable prior for the regulariza- 
tion parameter. Specifically, let log p(fij \ v) = — v|j3y|, and let T be an a-stable subordina- 
tor T a (s), < a < 1, observed at time 5=1. Then \(/(t) = \ffi-, and x(t) = \t\ a . Therefore 
the mixture penalty function is 




with no nuisance parameters left to estimate. 

3.3 Nonlinear time changes and further examples 

An even more general approach for building priors from time-changed Brownian motion is 
to specify the following: 

1. a self-similar random variable z = 
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2. a transformation u mapping z.j to the positive reals. Typical examples are the identity, 
inverse, and log. 

3. Brownian motion observed at random time increments 8j = u(zj). 

This approach encompasses many other examples of time-changed Brownian motion 
not previously studied in the presence of sparsity. These examples collectively speak to 



the power and generality of the approach considered here. For example, Barndorff-Nielsen 



and Shephard ( |2001| ) study the class of normal/modified-stable processes, where the mixing 
distribution is based on exponential and power tempering (or tilting) of a positive a-stable 
subordinator. Another interesting generalisation is the Normal-Lamperti distribution with 
mixing density 



sin(7ra) 



2\a-l 



k (Xj) 2a + 2(Xj) a cos(7ia) + 1 



Xf>0. 



The transformation u accommodates cases where the mixing distribution p(Xj) is not 
obviously self-similar. The horseshoe prior of |Carvalho et al. (2010) provides an example. 
In the usual hierarchical representation of this prior, one specifies a standard half-Cauchy 
distribution for the local scales: A,- ~ C + (0, 1). This corresponds to 

an inverted- beta (or beta-prime) distribution denoted IB (1/2, 1 /2). This generalizes to the 
wider class of normal/inverted-beta mixtures (Poison and Scott 2010 1, where Xf ~ IB (a,b). 
These mixtures satisfy the weaker property of being self-decomposable: if Xf ~ IB {a, b), 
then for every < c < 1, there exists a random variable e c independent of Xf such that 
Xf = cXf + e c in distribution. 

We omit the proof of the fact that the inverted-beta distribution is self-decomposable; 
see Example 3.1 in Bondesso"n] ( 19901. The consequence of this fact is that the horseshoe 
prior can be represented directly as subordinated Brownian motion. The proof is not con- 
structive, however, as the subordinator itself is not available in closed form. The difficulty 
becomes plain upon inspecting the characteristic function of an inverted-beta distribution: 

where U (x,y,x) is a Kummer function of the second kind. A characteristic function of this 
form makes it very difficult to compute the distribution of sums of inverted-beta random 
variables. 

Representing the horseshoe prior in terms of the increments of a self-similar Levy pro- 
cess can be done straightforwardly, however, on the log- variance scale, just as a self-similar 
representation of the lasso model can be found on the precision scale. 

Suppose Xf ~ lB(a,b). Then 



X 



2 D 



1-fQ 
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where JQ ~ Be (a, b). Following 



Fisher 



( 1935 1, if zi = log{K;/(l - K{)}, then 



P(zi) 



1 {e-') a 
B(a,b) (l+e*) a+b ' 



where B(a,b) is the Beta function. More generally we may assume that z.i ~ Z(a,b,jj,,a), 
a z-distribution with density 



Pol- 



and characteristic function 



2k [exp{(z,--/f)/g}] fl 
aB(a,b) [1 + exp{( Zi - n) / o}] a+b 



B(a+f ,i 



2jt / 



B(a,fe) 



exp(//if) 



(7) 



for a > 0, ft > 0, a > 0, }X e K. 

The z distribution can then be recognized as the special case the generalized-z (GZ) 
distribution, which has characteristic function 



0(0 



B(a + g ,1 



In / 



25 



B(a,fe) 



exp(/ i uf) 



for 5 > (Grigelionis 2001 ). This distribution has parameters (a,b,n,o,8) and can also 
be characterized by its Levy triple {A,0,jj.(x)dx}, where 



a< 5 p2n/a e -bx _ e -ax 



l-e~ 



K JO 



dx + jj. . 



(8) 



and 



f 25exp{^} 

*{l-exp(^)} ' ltX>U 



25exp{?fi} 
I W{l-exp(^)} 



if jc < . 



The characteristic function of a generalized-z distribution makes its self-similarity plain: 
if Zi ~ GZ(a,b,^i/p,a, 1/2/?), then 

!=1 

where z ~ Z(a,ft, i u,a). We thus have a self-similar representation, on the log-variance 
scale, of the normal/inverted-beta class. 

This result is of limited use except in special cases where the density of the generalized- 
z increments is known, which will not hold in general. Luckily the horseshoe prior, where 
a = b = 1/2, corresponds to just such a special case — as do all symmetric cases where 
k ~ Be(a, I— a) and Xf = k/{\ — k). 
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To see this, let z ~ Z(a, 1 — a,n, a) for a £ (0, 1). Then standard manipulations of the 
characteristic function ^ give 



0(0 



cos(c/2) 



expO'/i?) . 



cosh (^) 

where c = % (2a — 1). This is recognizable as the characteristic function of a Meixner 



process, z ~ Meix(a,c, l/2,ju) (Grigelionis 19991. The density and Levy measure of a 
Meixner random variable are 



Piz) 
H(dx) 



2cos(c/2) I ciz — jJ. 

exp 

an 

exp (ex/ a) 



r [ I + 

2 a 



2xsinh(7Tx/a 



Ax. 



(9) 
(10) 



For the horseshoe prior, a = 1 — a and therefore c = 0. 

A Meixner process is self-similar: if z.i ~ Meix{a,c, 1 / (2p),fi/p}, then 



Z; = z ~ Meix(a,c, 1/2, ju) . 



When a = 1 and }X = 0, then the random variable T = e z will have an IB (a, 1 — a) distribu- 
tion, as required. Therefore, the most intuitive way of passing to a limit under the horseshoe 
prior is to continue dividing the random variable T, on the log variance scale, into arbitrarily 
many self-similar increments. 

Interestingly, both the z-distribution and the Meixner can themselves be represented 
as mixtures of normals. The mixing distribution for the z is an infinite convolution of 



exponentials, a potentially interesting generalization of the lasso model (Barndorff -Nielsen 



et al.| 1982[ ). For the mixing distribution of the Meixner, see Mad an and Yor| ([2006). 



4 The general Levy-process case 

We have encountered two ways in a subordinator can be used to generate joint distributions 
for j6, or equivalently penalty functions: 

1. by subordinating Brownian motion to T(s), leading to a Levy process Z(s) whose 
increments are identified with the components of J3y. 

2. by using the subordinator's Laplace exponent vy(t) as a penalty function, which 
sometimes leads to a tractable mixture representation for the corresponding prior 
Pifii I v) °c exp{-viK/)}. 

In general Bayesians have focused on the finite-dimensional analogue of the first approach, 
while frequentists have focused on the second approach, although many authors have fo- 



cused on explicit translations of a classical estimator into a Bayesian model (e.g. Park and 
Casellal [20081 |Hansl |2009] >. 
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An encompassing formulation involving Levy processes is available. This is most easily 
understood in the case of an orthogonal design matrix X, in which case we define y = X'y. 
Let A = vp~\ and let 

&£ZC/A)-Z([/-1]A) 

for some arbitrary Levy process Z(s) having Levy measure n(dx), assumed to be defined 
over the interval [0, v]. Then upon observing y = (yi , . . . ,y p ) with yj ~ N(j3y, a 2 ), identify 
y with the increments of the interlacing process Y(s) =Z(s) + a p W(s): 

y 7 = F( 7 A)-y([j-l]A). 

The observations are themselves the increments a Levy process: a superposition of signals 
or jumps identified with Z(s), and noise identified with a scaled Wiener process W(s). 

The Bayesian local-shrinkage framework of Equations ([lJl-Q is to specify the distri- 
bution of the increment 8 = Z(jA) —Z([j — 1]A) as a Gaussian mixture. In general the 
corresponding Levy measure will not be known. Unless the mixing distribution belongs 
to some convolution-closed family (such as the gamma or inverse-Gaussian), we will not 
know the distribution of increments at other "time scales," and asymptotic analysis may be 
difficult. 

The above construction says, in effect, that one can proceed by specifying the Levy mea- 
sure directly, with the two subordinator-based approaches being intermediate cases. Indeed, 
by the Levy-Khinchine theorem, any model that preserves the conditional-independence 
property of the j8/s will fall into this framework, since any stationary cadlag process with 
independent increments is completely characterized by its Levy measure. 

By casting the finite-dimensional problem in terms of the marginal distributions of a 
suitable infinite-dimensional problem, the Levy process view provides an intuitive frame- 
work for asymptotic calculations. Such analysis can be done under one, or both, of two 
assumptions: that we observe the process longer, or that we observe it on an ever finer 
grid. Each scenario corresponds quite naturally to a different assumption about how the 
signal-to-noise ratio behaves asymptotically. 

Finally, it is possible to generalize these methods still further, following along the lines 



of the nonparametric function-estimation strategy proposed by Wolpert et al. (2010). These 
authors consider priors for kernel weights based on a stochastic integral of a generator func- 
tion with respect to a random measure, which allows for the incorporation of spatial marks, 



periodicities, and further covariates into the prior (see also |Clyde and Wolpert| 201 1| ). Since 



our interest is in priors for j3 that maintain exhangeability among the regression coefficients 
and thus correspond to traditional penalty functions, we do not pursue this approach here. 

5 The statistical relevance of the Levy-process view 
5.1 Levy processes and the two-groups model 

We now describe, in a more precise way, the result mentioned in the introduction: that 
Bayesian variable selection and pure-shrinkage solutions like the lasso can both be viewed 
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as special cases of the same encompassing framework. 

The familiar discrete mixture or "two-groups" model specifies that each j8 ; is either in 
or out of the model with some prior inclusion probability: 

Pj~wp(Pj) + (l-W)8o, 

where 80 is a Dirac measure. This is the typical assumption used in Bayesian model selec- 
tion, model averaging, and multiple testing (c.f. Geor ge and Foster[ 2000; Scott and Berger 



2010). 



The two-groups model arises as a special case of the Levy-process framework: namely, 
when Z(s) has a finite Levy measure and is therefore a compound Poisson process. Under 
this assumption, 

N(s) 

Y(s)= 1 £j i + aW(s), 
i=i 

where N(s) is a Poisson process with rate 6 governing the number of jumps that occur by 
time s, and each 7, is an independent draw from some jump distribution. 

With probability 1, a compound Poisson process will have a finite number of jumps on 
any finite interval. These jumps correspond to the nonzero signals in jS; all other increments 
of Z{s) will be zero. The Levy density of Z(s) describes the distribution of the signals, 
while the jump rate (which can be identified in terms of the total mass of the Levy density) 
describes their relative abundance in the cohort of /3/s under consideration. 

To illustrate the connection, suppose that /, ~ N(0, T] 2 ), and that we follow the previous 
line of reasoning by equating the regression coefficients /3y with the increments of Z(s) on 
a discrete grid of size A. Then with probability w = 1 — e~ 9A , fij will correspond to an 
interval where at least one jump has occurred. Moreover, each nonzero /$/ will arise from a 
normal distribution: 

fij ~ N(0,t 2 ) 
.2 f^l 2 



where wu = (k\)~ l (Ad) k exp(— Ad) is the probability of seeing k jumps. In essence, the 
missing k = term corresponds to the null hypothesis of no jumps, yielding j8y = 0. 

The discrete-mixture prior is an example of a finite-activity process where the total 
Levy measure is finite. But one could also use an infinite-activity process, where the Levy 
measure is merely sigma-finite. This would mean that the underlying process had an infinite 
number of very tiny jumps — in other words, that no j8 ; 's are zero, but that most are of 
insignificant size compared to a. The pure- shrinkage ("one-group") model and the two- 
groups model can therefore be subsumed into this single framework. 

An interesting question is: how different are the one-group and two-group models, 
asymptotically (i.e. as p — > 00, and therefore A — > 0)? Observe that under the two-groups 
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model where Z(s) is a compound Poisson process with jump density g, 



PQPjl >e)=AG ( g{x)&x + o{A) 
Ja(e) 



where H(e) = (— °°, e) U (e, oo). This decreases linearly in A, at a rate governed by the jump 
activity B of the Poisson process. 

Meanwhile, if Z{s) is instead a pure-jump Levy process with Levy measure /I (dx) , then 



P(\Pj\ >e)=A / n(dx)+o(A). 



Any Levy process necessarily assigns finite measure to the set Q.(e) for e > 0, so this 
probability also decreases linearly in A. In this sense, the class of priors derived from the 
increments of a Levy process encompasses those priors that can be made to asymptotically 
mimic the one-group model in terms of the measure they assign to for any e > 0. An 
interesting comparison is with the work of Berger and Delampady (1987]), especially their 



discussion concerning the validity of approximating interval nulls by point nulls. 
5.2 A representation of the posterior mean 

Much of the research on penalized-likelihood estimation concerns methods for finding 
sparse posterior modes in high-dimensional regression problems. Yet the posterior mean 
is the estimator that minimizes posterior expected loss under the squared-error loss func- 



tion, and can lead to improved predictions compared to the posterior mode (c.f. Efron 



2009). It is therefore interesting to compare the behavior of the posterior mean estimator 
under different joint distributions for j8. 

We again consider the orthogonal-design case, or the exchangeable normal-means prob- 



lem. Recall the following result from Pericchi and Smith ( 1992). If p(y — j8) is a normal 
likelihood of known variance a 2 , p(j3) is the prior for j3 (subject to some mild regularity 
conditions), and m(y) = J p(y — j8)/?(j6) d/3 is the predictive density for y, then: 

E(j3 \y)=y + a 2 ^-\nm(y). (11) 
dy 

This result is useful for the insight it gives about an estimator's behavior in situations 
where y is very different from the prior mean. In particular, it shows that "Bayesian robust- 
ness" may be achieved by choosing a prior for /3 such that the derivative of the log predictive 
density is bounded as a function of y. Models meeting the slightly stronger condition that 
{E(p | y) — y] — > for large \y\ are said to have redescending score functions. 

We generalize this result as follows. 

Theorem 2. Let p{\y — j8 1) be a likelihood that is symmetric in y — j8. Let y{t) be a penalty 
function satisfying the conditions ofTheorem^fort = fi 2 /2, andforwhich the correspond- 
ing subordinator T = T(s), s > 0, has a prior p(T) satisfying E{T~ 1 } < oo. Define the 
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following size-biased pseudo-density and corresponding marginals: 



Then 



P*(T) 
m*(y) 



E(P\y)=E(T- 



T~ l p(T) 
E(T) 

poo 

/ e- T Pl 2 p*{T)dT 
Jo 

p(y-P)p*(P)dp. 



m*{y) d 
m(y) dy 



\nm*(y) . 



(12) 



1975 



Special cases of this theorem have appeared repeatedly in the literature; c.f. Masreliez 
k, |Polson| (1991) , |Mitchell| ( fT994| >, |Carvalho et aL] (jgUTUb, and |Griffin and Brown 



(2010). These results have been used to characterize "good" mixing distributions p(A?) in 
the traditional global-local shrinkage model. The important insight is that the sparse signal- 
detection problem is essentially the same as the outlier-sensitivity problem, a classic topic 
of interest in robust Bayesian statistics. 

Our result extends this long line of research to provide a more general expression for the 
posterior mean. It uses the subordinator representation to characterize the posterior mean 
corresponding to any penalty function meeting the regularity conditions of Theorem [T] The 
intuition is essentially that, whenever a prior is chosen such that m*(y) has a small deriva- 
tive in a large neighborhood of the origin, the posterior mean will strongly shrink small 
observations to 0. The result also directly describes an estimator's sensitivity to aberrant 
observations — that is, signals — in terms of the corresponding Levy measure, rather than the 
prior for the local-shrinkage parameter Xj. 

Extending the general approach to non-orthogonal designs is straightforward, but alge- 



braically involved. It follows closely the method of proof pursued by Masreliez ( 1975 ) and 
IGriffin and Brown| ( [2"0T0"l ). 



5.3 Finding sparse solutions via the posterior mode 

Using Theorem [TJ we can also develop simple EM algorithms for estimating the posterior 
mode of /3 under a wide variety of models. 

First, one may express a wide variety of problems as mixtures of ridge regressions, 
following along the lines of Caron and Doucet ( [2008 ) and Armagan et al. ( 2010[ ). If we 
take f(Pj) = jfij as in the previous theorem, then a similar line of reasoning leads to an 
algorithm for finding the mode, rather than the mean. Under the conditions of theorem 1 , 
suppose we have 

-vyr(fSj/2) __ 



(13) 



Given a set of augmentation variables {7}}, the conditional log-posterior distribution be- 
comes 

i=l 7=1 
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where /, is the log-likelihood contribution associated with observation i. For a normal like- 
lihood, this will be the log density of a normal posterior whose mode is the generalized 
ridge estimator 

P = (X'X + v 2 T)X'y, 

where T = diag(7\ ,...,T p ). 

This provides the M step. Moreover, since the complete-data log likelihood is linear in 
Tj, its expected value given a current estimate /3 ^ is 

Q(P) = im-t E (Tj\Pj s) )pj/2- 

(=1 j=\ 

This expectation can be computed by differentiating ( fT3] l under the integral sign to give 

V^(jS?/2) 



E(7) | fij) 



Plugging in the current estimate j3y gives the E step. 

A second algorithm motivated by Theorem [T] generalizes the LLA approach of Zou and 



Li (20081. Suppose we take /(/$/) = |j3/|. Then if y(|j3;|) meets the conditions of the 



theorem, it is the log-moment generating function of a subordinator 7) = T(y), and 

poo 

exp{-v V (|&|)} = / e ' il i>(Tj)dTj 
Jo 

poo 

-V ¥ (\Pj\)=\og e- T ^p(Tj)dTj. 
Jo 

Recall that the time v at which the subordinator is observed corresponds to the global 
regularization parameter, assumed to be given. Taking derivatives with respect to fij inside 
the integral sign gives us the identity 

sign(l3j). VV '(\l5j\)=E(Tj\l5j). 

Here the expectation is with respect to the conditional posterior 

p(Tj\Pj)cc e - T iWp(Tj). 

Moreover, observe that the complete-data log-likelihood using Tj as an augmentation 
variable takes a simple form: 

This expression in linear in Tj, which suggests a simple EM algorithm. Suppose we have a 
current estimate f3j g \ For the E-step, we take the conditional expectation of the likelihood 
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/(/?) with respect to p(Tj\f}j) to obtain the objective function 

i=l 7=1 

where E (7) \ j3j) = sign(fij) • vy/djS/l). This is the usual convex optimization problem 
encountered in rinding a lasso solution, meaning that the M-step can be solved painlessly, 
using standard methods. Lasso is already in this form without the need for augmentation 
variables, but other models representable as mixtures of double-exponentials are just as 
simple to fit. 

An illuminating comparison is with the local-linear-approximation algorithm (LLA) of 
Zou and Li (2008), specifically Equations 2.7 and 2.10. Specifically, our Theorem [T] gen- 



eralizes 2.10 to cases beyond Laplace transforms of double-exponentials, and provides a 
probabilistic interpretation for all penalty functions in the class by expressing the corre- 
sponding Bayesian scale-mixture models in terms of an underlying Levy measure. This 
probabilistic interpretation also leads to the expressions for the posterior mean derived in 
the previous subsection. 



6 Regularized regression when p < n 

6.1 Connections among RR, PCR, PLS, and the g-prior 

Thusfar we have considered Levy processes for constructing high-dimensional joint prior 
distributions for regression coefficients in a manner than maintains the exchangeability of 
the j6/s. This nests the traditional local-shrinkage approach in (|TJ)-(j3]), considered by many 
authors. We now consider the mroe general case where the object of inferential interest is 
not necessarily /3 , but a set of linear combinations thereof — an approach that will generalize 
more easily to the p > n case. In particular we specify priors in the coordinate system 
defined by the principal components of X'X, although in principle other linear combinations 



follow the same template. This will illuminate connections among the work of Frank and 



Fried man| ( |1993| ), |West| ( |2003| ),|Maruyama and George|( p010| ), and ours on Levy processes. 



Let X = UDW' represent the singular- value decomposition of the design matrix X. If 
n > p, then X is of full column rank, and D = diag(<ii, . . . ,d p ) is a diagonal matrix of 
nonzero singular values ordered d\ > ■ ■ ■ > d p . Both U and W are orthogonal matrices, of 
dimensions n x p and p x p, respectively. Moreover, W is also the matrix of eigenvectors 
{wj} for the cross-product matrix S = X'X, with corresponding eigenvalues dj. 

The original regression relationship may be re-expressed in the orthogonalized space as 
y = Za + £, where Z = UD and a = W'^. The ordinary least-squares (OLS) estimate for 
a is a = (Z'Z)- l Z'y = D~ x U'y. 

Following Frank and Friedman ( 1993) ), the shrinkage structures for many common reg- 



ularization approaches can be understood by expanding their solutions in the original coor- 
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dinate system in terms of the eigenvectors {w\,. . . ,w p } and the OLS coefficients a: 



(14) 



Here M denotes the method, and the Kj 's are method-specific shrinkage weights that scale 
the OLS solution along each of the directions wj. 

Both ridge regression and principal-components regression use shrinkage weights that 
do not depend on the response values y. The ridge-regression solution is kJ r = dj/(v + dj) 
for a fixed regularization parameter v, while the ^-component PCR solution is 



r PCR 



1, 4>4 

0, d)<d\ 



The posterior mean under the g -prior also fits in this shrinkage structure; it corresponds 
to Kj = g/(l + g), thereby shrinking the solution vector along all eigen-directions by a 
common factor. 

The shrinkage weights under partial least squares, on the other hand, depend nonlinearly 



upon the response values y through the OLS solution a. Using the expressions in Frank and 



Friedman ( 1993 1, for the ^-component solution we have 



r PLS 



where 6 = {9\, ... , 6k}' is equal to W Tj, with 



ri k = ± a]df k+l KndW kl = t &jdf +I+l \ 



7=1 



7=1 



6.2 A Bayesian interpretation 

These four procedures differ only in the way that they scale the OLS estimates for the 
regression parameter in the orthogonal coordinate system defined by W. It is therefore 
natural to consider them as special cases of an encompassing local-shrinkage model along 
the lines of the previous sections. 

Begin with the g-prior, an explicitly Bayesian model wherein j3 ~ N{0, G 2 g(X'X)~ 1 } 
a priori, or equivalently a ~ N(0, o 2 gD~ 2 ). This prior biases the direction of a along the 
axes of the principal-component coordinate system. 

Ridge regression also has a well-known Bayesian interpretation as the posterior mean 
under the conjugate normal prior j8 ~ N(0,a 2 T 2 /), where the global variance T 2 = 1/v. 
This prior is agnostic with respect to the orientation of the regression vector, depending 
only upon its Euclidean norm. 

These procedures, along with PCR, are all special cases of a more general prior: 

(a | a 2 , t 2 , A) ~ N(0, a 2 T 2 A) , (15) 



18 



where T 2 is a global variance component and A = (A 2 ,..., A 2 ) is a diagonal matrix of 
local variance components. The posterior distribution of a under this prior is conditionally 
normal, with mean 

„ ( z 2 Xjd) \ 

with the a/s being mutually independent given z , a , and the data. 

The classical g-prior therefore corresponds to T 2 = g and Xj = dj 2 . Ridge regression 
corresponds to A 2 = 1 . And PCR corresponds to 

xi={ ? i^i 



'i I 0, d 2 <d\ 

for the ^-component solution. 

Rather than estimating a under fixed choices of the local variances Xj, the natural fully 
Bayesian approach is to use the shrinkage weights 

FB ( T ' A M "\ 

■9 = E ^\x*)[T^xjdj)> (16) 

where the expectation is over the posterior distribution of local and global variance compo- 
nents. 

Different choices for the priors p(Xj) and p(z 2 ) can center the Bayesian model at dif- 
ferent classical regularization approaches, while still allowing the data to dictate otherwise. 
Choosing p(Xj) to concentrate near 1, for example, will center the model near the classical 
ridge solution. On the other hand, if Xj = dj 2 v 2 , then choosing p(v 2 ) to concentrate near 
1 will center the model near the g-prior. Placing a further prior on x 2 will replicate the 



mixtures of g -priors studied by|Liang et al. (2008 



Mixing over a further prior p(A), however, will lead to even more flexible mixtures of 
g-priors. In particular, the classical g-prior prefers coefficient vectors that line up with the 
principal components, and further mixing over local variance components helps to robustify 
the model against this assumption. 

Even the PCR solution can be chosen as an approximate centering model by selecting 
a prior p(Xj) such that p(Kj) concentrates simultaneously near and 1. For example, if 
T 2 = 1 and Xj follows an inverted-beta (or "beta-prime") distribution IB(l/2, 1 /2), then Kj 
will have a Be(l /2, 1 /2) prior, whose density function is unbounded both at and at 1 as 



required. Marginally this leads to a horseshoe prior for (Xj (Carvalho et al. 20101. 

Partial least squares, on the other hand, cannot be interpreted in this framework. To see 
this, observe that the shrinkage weights are identified with the prior variance components 
via Kj = z 2 Xjd 2 /(l + z 2 Xjd 2 ). Under PLS, some of the shrinkage weights K*jjf may be 
larger than 1. Such weights cannot arise from a valid (non-negative) configuration of A 2 's 
and T 2 . Therefore, PLS cannot be the optimal solution under any prior expressible as a 
global-local scale mixture of normals. 
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6.3 When should the full Bayesian model work better? Some intuition and 
examples 

Ridge regression, PCR, and PLS are all operationally similar. They bias the coefficient vec- 
tor away from directions in which the predictors have low sampling variance — or equiva- 
lently, away from the "least important" principal components of X. This leads to a favorable 
bias-variance tradeoff in the performance of the resulting estimator. The g-prior and mix- 
tures of g-priors, on the other hand, shrink along all eigen-directions equally, and usually 
not by very much. 

Neither of these approaches need work well. When the underlying regression signal is 
"eigen-sparse" — that is, when only some of the linear combinations of j8/s given by W are 
meaningful for predicting y — then one should shrink different components of a by different 
amounts. This makes the g-prior inappropriate. 

Yet as many previous authors have noted, there is no logical reason that y cannot be 
strongly associated with the low-variance principal components of X. Ridge regression and 
PCR will both do poorly in these situations: RR will necessarily shrink more along low- 
variance directions, while PCR must include all the higher-variance directions (J < K) in 
order to include a lower-variance one (K). 



The intuition behind the fully Bayes model of (15) is that the shrinkage weights Kj 
should indeed be unequal, but that they can be learned from the data, and need not be 
monotonic in dj . The fully Bayes shrinkage weights, moreover, will depend not merely on 
X. They will also depend nonlinearly upon y, and upon each other through their mutual 
dependence upon T 2 . 

Consider three illustrative examples. Although there are many options to explore using 
the results of previous sections, in all cases we have assumed for the sake of illustration that 
T 2 ~ IB(l/2, 1/2) and that Xj ~ IB(l/2, 1/2), thereby specifying a geometric-Meixner- 
process prior for a (see Section [33]>. 



First, we analyzed the data from Fearn (19831, consisting of 24 samples of ground 



wheat. The response variable is the protein concentration in the wheat, while the predictors 
(L1-L6) are measurements of the samples' reflection of NIR radiation (R), measured at six 
different wavelengths between 1680 and 2310 nanometers. The predictors are referred to 
as "log values", since they are measured on a log(l//?) scale. The goal is to find a linear 
combination of log values that predicts protein concentration. Both the response and the 
predictors were centered and rescaled to have variance 1 . 

The log values are highly multi-collinear, with the smallest pairwise correlation being 
0.925. Despite the fact that ridge regression is intended for just these multi-collinear situ- 
ations, here it performs quite poorly. As Fearn (1983]) explains, this happens because the 



first principal component places nearly equal weight on all six log values (see Table[2]). The 
variation described by this component — essentially the sample average of the log values — is 
due mainly to differences in particle size. It carries little information about protein content, 
and yet is prefentially treated as the "most important" predictor by the ridge estimator. Con- 
trasting log values are associated with "less important" principal components, and yet these 
contrasts — mostly the second, third, and fourth — are far more useful for predicting pro- 
tein concentration. Ridge regression shrinks these components more aggressively than the 
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Table 2: The six principal component variances and loadings for the wheat protein- 
concentration data. 





PCI 


PC2 


PC3 


PC4 


PC5 


PC 6 


LI 


0.411 


0.213 


0.265 


-0.353 


0.422 


0.642 


L2 


0.410 


0.342 


-0.446 


-0.079 


0.465 


-0.542 


L3 


0.411 


0.266 


-0.367 


-0.209 


-0.743 


0.173 


L4 


0.411 


-0.028 


0.731 


-0.127 


-0.221 


-0.481 


L5 


0.396 


-0.874 


-0.242 


-0.126 


0.067 


0.023 


L6 


0.411 


0.05 


0.05 


0.891 


0.013 


0.182 


Variance 


5.868 


0.101 


0.019 


0.012 


< 0.001 


< 0.001 



other methods. Also observe the large amount of uncertainty surrounding the higher-order 
shrinkage factors. 

Second, we analyzed data on the softening temperature (y) of n = 99 ash samples orig- 
inating from different biological sources. The predictor matrix comprises p = 16 observed 
mass concentrations for the ash samples' constituent molecules. The measurements are 
highly multi-collinear, with the eigenvalues of the correlation matrix for X spanning 10 or- 
ders of magnitude. The data are available in the R package chemometrics, and have been 
centered and scaled. 

Finally, we analyzed synthetic data where X corresponds to a factor model. That is, 
each row xf t satisfies 

Xi = Bf i + $ i , 

where the loadings matrix B is p x k, f, : ~ N(0,7) is k x 1, ^ ~ N(0, y/7) is p x 1, and k < p. 
The predictors that arise from this structure will exhibit multi-collinearity, and when \j/ is 
small compared to the entries in B, this multi-collinearity will be very pronounced. In a 
factor model, moreover, it need not be the case that y will be associated most strongly with 
the high- variance principal components of X. 

We generated data where p = 20, n = 100, k = 5, and \jf = 0.1, with all the entries of 
B set to 1. The resulting coefficient vector, least-squares estimate, and eigenvalues D are 
excerpted in Table [3] Principal component 12 is clearly the outlier: it is a strong predictor 
of y, and yet its corresponding variance is two orders of magnitude smaller than the largest 
variance. 

Figure [T] compares the shrinkage structures of RR, PCR, PLS, and the Bayesian model 
for all three of these data sets. The components are ordered left to right along the x axis from 
highest variance (1) to lowest variance (p), while the shrinkage coefficients K (Equation 14 1 
are along the y axis. The tuning parameters for the non-Bayesian methods were chosen by 
cross-validation. 

In all three cases, there appears to be a tendency for both PCR and ridge regression to 
over-shrink coefficients corresponding to low-variance eigen-directions. On the ash data 
set, components 7 and 9 seem to be important, while for the factor model, component 12 is 
known to be the most important. Yet all are shrunk nearly to zero by RR and PCR. For the 
sake of variance reduction, too much bias is introduced. 
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Wheat data: shrinkage coefficients 
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Ash data: shrinkage coefficients 
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Synthetic factor data: shrinkage coefficients 
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Figure 1 : Comparison on three data sets in terms of how much the four methods shrink 
each principal component. Grey dots (grey lines): posterior means (75% credible intervals) 
under the fully Bayesian model. Blue dashes: ridge regression. Red dots: partial least 
squares. Black dots and dashes: principal-component regression. 



22 



Table 3: Subset of the true orthgonalized coefficient vector, least-squares estimate a, and 
eigenvalues for Example 3, where X is a five-factor model. 



Comp. 

r 

2 



-0.10 
-0.02 



-0.11 
-0.50 



D 
91.83 
1.41 



11 


0.42 


1.36 


0.98 


12 


12.10 


12.16 


0.91 


13 


0.04 


0.13 


0.85 


19 


0.39 


-1.35 


0.60 


20 


0.00 


-1.87 


0.58 



Partial least-squares, on the other hand, can identify important low-variance compo- 
nents. Yet it does so by including many other unimportant low-variance components. For 
the sake of bias reduction, too much variance is introduced. 

The fully Bayesian model seems to blend the best of both these techniques. It can 
successfully pick out important coefficients corresponding to low- variance eigen-directions. 
Yet at the same time, it can squelch the other unimportant components. Intuitively, this 
combination should make for a favoriable bias-variance tradeoff in larger problems. 



7 Regression when p > n 



7.1 Generalization to large-p cases 

Suppose now that the design matrix X is of rank r < p and has singular- value decomposition 
X = UDW' with D = diag(cii ,...,d r ), again ordered from largest {d\) to smallest (d r ). The 
approach of the previous section works just as before, with no essential modification: 



(a | a, a 

r 2 ,2 



(a | a ,t ,A) 



N(a,a 2 D~ 2 ) 
N(0,aVA) 

P&}) 
P(a 2 ,r 2 ), 



where a = W'p and a is the corresponding OLS estimate. Instead of a /^-dimensional 
vector to estimate, we now have an r-dimensional one. Moreover, because we have orthog- 
onalized the coefficients, the elements of a are conditionally independent in the posterior 
distribution, given T 2 and a 2 . We are faced with a simple normal-means problem, with the 
only complication being that the singular values dj enter the likelihood. 



This approach is also related to the work of Maruyama and George (2010), who propose 
a modification of the standard g-prior (Zellner, 1986) for use in Bayesian variable selection 
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when p > n. Suppose that 



p{P) = Y[p j (w'fi\g,o 2 

7=1 



Each pj(w'S | g, a 2 ) is a normal density, 



^WjP\o,^fj(i+i 




(17) 



where wj is the 7th right-singular vector of X, and where /,• > 1 is necessary to ensure 
positive defrniteness. 

The seemingly strange form of ( 17 1 harks back to Strawderman ( 1971[ ). Structurally, it 
essentially the same prior considered above, with a slight modification made for the sake of 
ensuring that the marginal distribution p(y) is analytically convenient (see Section 4.7.10 
of Be rger|[T985[). M aruyama and George recommend mixing over a prior for g while fixing 
fj = dj/d 2 in yTv- This approximately corresponds to a similar fixed choice for the A?'s 
in ([T5]>. 

Under this prior, there exists a closed-form expression for the Bayes factor between 
any two submodels of the full /^-variable model. This allows one to perform full Bayesian 
model selection even when p > n. 

Our proposal is an alternative generalization appropriate for pure shrinkage solutions, 
one that incorporates additional mixing over local variances X 2 . If we treat W as the canon- 
ical pseudo-inverse that maps back to the original coordinate system, then the implied prior 
for j6 = Wot is a singular normal distribution: 



03 I A, t , a 2 ) ~ N(0, a 2 r 2 WAW') . 



To see the connection with the g-prior more explicitly, suppose that Xj = d- 2 and that n> p, 
such that X is of full column rank. It is easily verified that WD~ 2 W' = (X'X)^ 1 , leading to 
the original g-prior with g = z 2 . Other authors have considered the same generalization, but 
with simple conjugate priors for A? — for example, Clyde et aT| ( 1996 1, Denison and George 
(2000), and West ( 2003| ). Our approach differs in our emphasis placed upon the choice of 
prior for A?, for which the developments earlier in the paper are clearly relevant. 

Under this model, the (conditional) posterior mean estimator for ctj is, just as before, 
given by 

/ T 



1 



a 



j > 



a generalized Bayesian version of the classic ridge estimator. 



7.2 Assessing out-of-sample predictive performance 

In the following simulation studies, we investigate the performance of the Bayesian model 
proposed above. We use the horseshoe prior, whereby T and each Xj receive independent 
half-Cauchy priors. We now sketch a brief rationale for this choice. Intuitively, the vectors 
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{wj} can be thought of as contrasts. A nice "default" Bayesian model would express the 
prior belief that certain contrasts of the j3 sequence will be strong predictors of y, and that 
some will be weak predictors. The horseshoe prior does just this: it will shrink most a/s 
very strongly, as the posterior mass for X tends to concentrate near zero. Yet it will leave 
unshrunk those a/s corresponding to contrasts that predict y well — even, it is to be hoped, 
those that correspond to a low-variance principal components — since the heavy tails of the 
half-Cauchy prior will allow certain A/s to be quite large. 

As test cases, we used the following 7 data sets, all of which had more predictors than 
observations. Only 1 of the 7 data sets is simulated; the other 6 are from chemometrics 
or genomics. All are available upon request from the authors, and the 6 real data sets are 
available from the R packages pis, chemometrics, and mixOmics. 

factor: the only simulated data set considered. Both X and y were generated jointly from a 
standard Bayesian factor model, with y loading most heavily on the lowest-variance 
factors. 

nutrimouse: observations of 40 mice where hepatic fatty-acid concentrations are regressed 
upon the expression of 120 potentially relevant genes measured in liver cells. 

cereal: chemometric observations of 15 cereal molecules where starch content is regressed 
upon NIR spectra at 145 different wavelengths. 

yarn: samples of 28 polyethylene terephthalate (PET) yarns, where the density of the yarn 
sample is regressed upon measurements of NIR spectra at 268 wavelenths. 

gasoline: octane numbers of 60 gasoline samples along with NIR spectra at 401 wave- 
lengths. 

multidrug: the X matrix comprises observations of the activity of 853 drugs on 60 different 
human cell lines, expressed as the concentration at which each drug leads to a 50% 
inhibition of growth for each cell line. The y variable is the measured expression of 
ABC3A (an ATP-binding cassette transporter) in each cell line. 

liver: the X matrix contains the expression scores for 3 1 16 genes in 64 rat subjects. The y 
variable is the cholesterol concetration in the liver. 

We compare the Bayesian model to the three basic techniques (partial least squares, 
ridge regression, and principal-components regression), along with a new technique called 
sparse partial least squares (Chun and Keles 2010[ ) aimed at simultaneous dimension re- 



duction and variable selection. This final method is implemented in the R package spls. 

To test these five methods, we split each of the seven data sets into training and test 
samples, with 75% of the observations used for training. We then fit each model using 
the training data, with tuning parameters for the non-Bayesian methods chosen by ten-fold 
cross validation on the training data alone. We then compared out-of-sample predictive 
performance on the holdout data, measured by sum of squared prediction errors (SSE). In 
each case the y variable was centered, and the X variables were centered and scaled. 
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Table 4: Average out-of-sample predictive error (SSE) on 50 different train/test splits for 
7 data sets where p > n. Bayes: the local-shrinkage model with horseshoe priors. PLS: 
partial least squares. PCR: principal-components regression. RR: ridge regression. SPLS: 
sparse partial least squares. The smallest entry in each row is in boldface. 

Average out-of-sample error 



Data set 


n 


P 


Bayes 


PLS 


PCR 


RR 


SPLS 


factor 


50 


100 


45.8 


66.9 


69.2 


358 


97.6 


nutrimouse 


40 


120 


394 


428 


467 


394 


462 


cereal 


15 


145 


45.2 


46.9 


46.3 


42.2 


46.5 


yarn 


28 


268 


2.63 


6.89 


20.2 


4.18 


53.8 


gasoline 


60 


401 


0.82 


0.87 


0.93 


0.72 


1.04 


multidrug 


60 


853 


139 


152 


173 


143 


160 


liver 


64 


3116 


1340 


1457 


1475 


1407 


1470 



All of our results in Table [4] represent the average SSE incurred over 50 different 
train/test splits. There are several interesting things to notice here. For one thing, the Bayes 
method seems to be the overall winner. It was the outright best on 4 data sets, tied for best 
on 1 data set, and second-best on the other two data sets. Surprisingly, the next-best method 
seems to be a venerable classic: ridge regression. The newest method, sparse partial least 
squares, was either worst or second-worst on all 7 data sets. 

The two cases where the Bayesian method offered the biggest improvements — the fac- 
tor data and the yarn data — are also instructive. In these cases, the y variable was most 
strongly associated with smaller-variance contrasts wj, or in other words, those contrasts 
associated with smaller singular values dj. Much as we saw in the previous section, classic 
methods like ridge regression and PCR perform poorly when this is the case, whereas the 
Bayesian model is quite robust. 

In other cases (notably the cereal, gasoline, and nutrimouse data sets), the signal-to- 
noise ratio seems to be either so favorable, or so poor, that all the methods do almost equally 
well. This suggests that the extra variance induced by mixing over local A?'s does not pose 
difficulty for the Bayesian model. 

8 Final Remarks 

The study of oracle properties provides a unifying framework in the classical literature for 
the study of regularized regression, but no such framework exists for Bayesians. In this 
paper, we have offered a few elements that might form the beginnings of such a framework. 
By identifying /3 (or a) with the increments of a discretely observed Levy process, we have 
embedded the finite-dimensional problem in a suitable infinite-dimensional generalization. 
This provides a natural setting in which the dimension p grows without bound. In par- 
ticular, Theorem [T] establishes mappings among Levy processes, penalty functions, priors, 
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and scale mixtures of well-known distributions. This offers a convenient way of generat- 
ing infinitely divisible probability distributions with known probabilistic structure, giving 
Bayesian statisticians a much larger toolbox for building shrinkage models like the kind 
explored in Section [7] 

A Proofs of main results 
Proof of Theorem Q] 

For Part A, since y(t) is totally monotone, it has derivatives of all orders and satisfies 

(-iyyW(f) <o. 

Furthermore, since lim f ^o V(0 = 0> then by Bernstein's theorem \jf(t) corresponds to the 



Laplace exponent of some subordinator T(s) (see, e.g., Cont and Tankov| 2004 Chapter 
4). That is, there exists a subordinator T (s) with Levy measure fl(dx) whose moment- 
generating function can be written as 

M T {t) = E{exp[-tT(s)]} = exp{-^(0} , (18) 

where y(t) is called the Laplace exponent and is given by its Levy representation in Equa- 
tion ([5]>. 

We recognize the mixture-of-normals representation in Part B as follows. Write the 
expectation in ( 18 1, evaluated at t = j3?/2, as 

E{expHr,} = J exp{-j3?7;/2} p(T s )dT s 

= £y/T,exp{-p}T s /2} {T- l/2 p(T s )} dT Sl 
where p(T s ) is the marginal density of the subordinator T observed at time s. The expression 

— 1/2 

T s p(T s ) is thus clearly proportional to a prior density for the precision T in a Gaussian 
mixture for f5j. 

This gives an explicit representation of the mixing density as the power-tilted density 
of the subordinator when a = 2. 

Proof of Theorem |2] 

By definition, p(fi) = Jq e~ T ^ g(T)dT . Therefore 

m{y) = Jp(y-P)J e- T -g(T)dTdf5. 
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The posterior mean is given by 

my) = W)I p(y-W e ~ T ^s(T)dTdp 

Using integration by parts yields 

my) = E ^wS Y," {y -^ T " 4s ' {T)dTdli 
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